Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers..
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
import datetime
from sklearn.cluster import KMeans
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from sklearn import preprocessing
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
df=pd.read_csv('marketing_campaign.csv',sep='\t')
df.head(10)
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 5 | 7446 | 1967 | Master | Together | 62513.0 | 0 | 1 | 09-09-2013 | 16 | 520 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 6 | 965 | 1971 | Graduation | Divorced | 55635.0 | 0 | 1 | 13-11-2012 | 34 | 235 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 7 | 6177 | 1985 | PhD | Married | 33454.0 | 1 | 0 | 08-05-2013 | 32 | 76 | ... | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 8 | 4855 | 1974 | PhD | Together | 30351.0 | 1 | 0 | 06-06-2013 | 19 | 14 | ... | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 9 | 5899 | 1950 | PhD | Together | 5648.0 | 1 | 1 | 13-03-2014 | 68 | 28 | ... | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
10 rows × 29 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 507.6+ KB
df.describe()
| ID | Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2240.000000 | 2240.000000 | 2216.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | ... | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.0 | 2240.0 | 2240.000000 |
| mean | 5592.159821 | 1968.805804 | 52247.251354 | 0.444196 | 0.506250 | 49.109375 | 303.935714 | 26.302232 | 166.950000 | 37.525446 | ... | 5.316518 | 0.072768 | 0.074554 | 0.072768 | 0.064286 | 0.013393 | 0.009375 | 3.0 | 11.0 | 0.149107 |
| std | 3246.662198 | 11.984069 | 25173.076661 | 0.538398 | 0.544538 | 28.962453 | 336.597393 | 39.773434 | 225.715373 | 54.628979 | ... | 2.426645 | 0.259813 | 0.262728 | 0.259813 | 0.245316 | 0.114976 | 0.096391 | 0.0 | 0.0 | 0.356274 |
| min | 0.000000 | 1893.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 25% | 2828.250000 | 1959.000000 | 35303.000000 | 0.000000 | 0.000000 | 24.000000 | 23.750000 | 1.000000 | 16.000000 | 3.000000 | ... | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 50% | 5458.500000 | 1970.000000 | 51381.500000 | 0.000000 | 0.000000 | 49.000000 | 173.500000 | 8.000000 | 67.000000 | 12.000000 | ... | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 75% | 8427.750000 | 1977.000000 | 68522.000000 | 1.000000 | 1.000000 | 74.000000 | 504.250000 | 33.000000 | 232.000000 | 50.000000 | ... | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| max | 11191.000000 | 1996.000000 | 666666.000000 | 2.000000 | 2.000000 | 99.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | ... | 20.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 3.0 | 11.0 | 1.000000 |
8 rows × 26 columns
from dataprep.eda import plot, plot_correlation, create_report, plot_missing
plot(df)
| Number of Variables | 29 |
|---|---|
| Number of Rows | 2240 |
| Missing Cells | 24 |
| Missing Cells (%) | 0.0% |
| Duplicate Rows | 0 |
| Duplicate Rows (%) | 0.0% |
| Total Size in Memory | 883.0 KB |
| Average Row Size in Memory | 403.6 B |
| Variable Types |
|
| MntFruits and MntSweetProducts have similar distributions | Similar Distribution |
|---|---|
| Income has 24 (1.07%) missing values | Missing |
| Income is skewed | Skewed |
| MntWines is skewed | Skewed |
| MntFruits is skewed | Skewed |
| MntMeatProducts is skewed | Skewed |
| MntFishProducts is skewed | Skewed |
| MntSweetProducts is skewed | Skewed |
| MntGoldProds is skewed | Skewed |
| NumDealsPurchases is skewed | Skewed |
| NumWebPurchases is skewed | Skewed |
|---|---|
| NumCatalogPurchases is skewed | Skewed |
| NumStorePurchases is skewed | Skewed |
| NumWebVisitsMonth is skewed | Skewed |
| Dt_Customer has a high cardinality: 663 distinct values | High Cardinality |
| Z_CostContact has constant value "3" | Constant |
| Z_Revenue has constant value "11" | Constant |
| Kidhome has constant length 1 | Constant Length |
| Teenhome has constant length 1 | Constant Length |
| Dt_Customer has constant length 10 | Constant Length |
| AcceptedCmp3 has constant length 1 | Constant Length |
|---|---|
| AcceptedCmp4 has constant length 1 | Constant Length |
| AcceptedCmp5 has constant length 1 | Constant Length |
| AcceptedCmp1 has constant length 1 | Constant Length |
| AcceptedCmp2 has constant length 1 | Constant Length |
| Complain has constant length 1 | Constant Length |
| Z_CostContact has constant length 1 | Constant Length |
| Z_Revenue has constant length 2 | Constant Length |
| Response has constant length 1 | Constant Length |
| MntFruits has 400 (17.86%) zeros | Zeros |
| MntFishProducts has 384 (17.14%) zeros | Zeros |
|---|---|
| MntSweetProducts has 419 (18.71%) zeros | Zeros |
| NumCatalogPurchases has 586 (26.16%) zeros | Zeros |
This dataset gives 2240 different customers basic information, their product purchasing preferences as well as their reactions to some marketing compaigns Income hase 24 cell missing value
People
Products
Promotion
Place
Your answer here!
df_copy=df.copy()
#df_copy[df_copy.Income.isna()]
# drop null value in income column
df_copy.dropna(subset=['Income'],inplace=True)
#Test
df_copy[df_copy.Income.isna()]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response |
|---|
0 rows × 29 columns
### -Need Calc Age From Year of bearth
year = datetime.datetime.today().year
df_copy['Age']=year-df_copy['Year_Birth']
df_copy['Age'].describe()
count 2216.000000 mean 52.179603 std 11.985554 min 25.000000 25% 44.000000 50% 51.000000 75% 62.000000 max 128.000000 Name: Age, dtype: float64
df_copy['ChildNumber']=df_copy['Kidhome']+df_copy['Teenhome']
df_copy['HaseChild']=df_copy['ChildNumber'].apply(lambda x: 1 if x>0 else 0)
df_copy.HaseChild.value_counts()
1 1583 0 633 Name: HaseChild, dtype: int64
df_copy.drop(columns=['Z_Revenue','Z_CostContact','id'],inplace=True)
df_copy['spent']=df_copy['MntWines']+df_copy['MntFruits']+df_copy['MntMeatProducts']+df_copy['MntFishProducts']+df_copy['MntSweetProducts']+df_copy['MntGoldProds']
df_copy.Education.value_counts()
Graduation 1116 PhD 481 Master 365 2n Cycle 200 Basic 54 Name: Education, dtype: int64
df_copy.Education=df_copy.Education.apply(lambda x: 'UnGraduation' if x=='2n Cycle' or x=='Basic' else x )
df_copy.Education.value_counts()
Graduation 1116 PhD 481 Master 365 UnGraduation 254 Name: Education, dtype: int64
df_copy['Income']=df_copy['Income']/1000
df_copy.Marital_Status.value_counts()
Married 857 Together 573 Single 471 Divorced 232 Widow 76 Alone 3 YOLO 2 Absurd 2 Name: Marital_Status, dtype: int64
df_copy.Marital_Status=df_copy.Marital_Status.apply(lambda x: 'Single' if x=='Alone' or x=='Absurd' or x=='YOLO' else x )
df_copy.drop(df_copy[(df_copy['Income']>200)|(df_copy['Age']>100)].index,inplace=True)
le = preprocessing.LabelEncoder()
le.fit(df_copy.Education)
df_copy['EducationId']=le.transform(df_copy['Education'])
le2 = preprocessing.LabelEncoder()
le2.fit(df_copy.Marital_Status)
df_copy['Marital_StatusId']=le2.transform(df_copy['Marital_Status'])
print(le.inverse_transform([0]))
['Graduation']
kmeans_model = KMeans(init='k-means++', max_iter=400, random_state=42)
intersetColumns=['spent','Age','EducationId','Marital_StatusId','HaseChild','Income','NumDealsPurchases','ChildNumber','Response']
kmeans_model.fit(df_copy[intersetColumns])
KMeans(max_iter=400, random_state=42)
def try_different_clusters(K, data):
cluster_values = list(range(1, K+1))
inertias=[]
for c in cluster_values:
model = KMeans(n_clusters = c,init='k-means++',max_iter=400,random_state=42)
model.fit(data)
inertias.append(model.inertia_)
return inertias
# Find output for k values between 1 to 12
outputs = try_different_clusters(12, df_copy[])
distances = pd.DataFrame({"clusters": list(range(1, 13)),"sum of squared distances": outputs})
File "C:\Users\moham\AppData\Local\Temp/ipykernel_1460/670720742.py", line 2 outputs = try_different_clusters(12, df_copy[]) ^ SyntaxError: invalid syntax
# Finding optimal number of clusters k
figure = go.Figure()
figure.add_trace(go.Scatter(x=distances["clusters"], y=distances["sum of squared distances"]))
figure.update_layout(xaxis = dict(tick0 = 1,dtick = 1,tickmode = 'linear'),
xaxis_title="Number of clusters",
yaxis_title="Sum of squared distances",
title_text="Finding optimal number of clusters using elbow method")
figure.show()
kmeans_model_new = KMeans(n_clusters = 5,init='k-means++',random_state=42)
df_copy['Cluster']=kmeans_model_new.fit_predict(df_copy[['spent','Age','EducationId','HaseChild','Income','NumDealsPurchases']])
df_copy['Cluster'].value_counts()
2 1008 3 417 0 403 1 265 4 119 Name: Cluster, dtype: int64
# visualize clusters
figure = px.scatter_3d(df_copy[['spent','Age','EducationId','HaseChild','Income','NumDealsPurchases','Cluster']],
color='Cluster',
x="Income",
y="Age",
z="spent",
category_orders = {"Cluster": ["0", "1", "2", "3",'4']}
)
figure.update_layout()
figure.show()
ndata=df_copy[['spent','Age','EducationId','HaseChild','Income','NumDealsPurchases','Cluster']].copy()
df_copy.describe()
| ID | Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | AcceptedCmp2 | Complain | Response | Age | ChildNumber | HaseChild | spent | EducationId | Marital_StatusId | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | ... | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 | 2212.000000 |
| mean | 5585.160940 | 1968.913653 | 51.958811 | 0.441682 | 0.505877 | 49.019439 | 305.287523 | 26.329566 | 167.029837 | 37.648734 | ... | 0.013562 | 0.009042 | 0.150542 | 52.086347 | 0.947559 | 0.714286 | 607.268083 | 0.940778 | 1.730561 | 1.811935 |
| std | 3247.523735 | 11.701599 | 21.527279 | 0.536955 | 0.544253 | 28.943121 | 337.322940 | 39.744052 | 224.254493 | 54.772033 | ... | 0.115691 | 0.094678 | 0.357683 | 11.701599 | 0.749466 | 0.451856 | 602.513364 | 1.083414 | 1.062373 | 1.103378 |
| min | 0.000000 | 1940.000000 | 1.730000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 25.000000 | 0.000000 | 0.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2814.750000 | 1959.000000 | 35.233500 | 0.000000 | 0.000000 | 24.000000 | 24.000000 | 2.000000 | 16.000000 | 3.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 44.000000 | 0.000000 | 0.000000 | 69.000000 | 0.000000 | 1.000000 | 1.000000 |
| 50% | 5454.500000 | 1970.000000 | 51.371000 | 0.000000 | 0.000000 | 49.000000 | 175.500000 | 8.000000 | 68.000000 | 12.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 51.000000 | 1.000000 | 1.000000 | 397.000000 | 0.000000 | 2.000000 | 2.000000 |
| 75% | 8418.500000 | 1977.000000 | 68.487000 | 1.000000 | 1.000000 | 74.000000 | 505.000000 | 33.000000 | 232.250000 | 50.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 62.000000 | 1.000000 | 1.000000 | 1048.000000 | 2.000000 | 3.000000 | 2.000000 |
| max | 11191.000000 | 1996.000000 | 162.397000 | 2.000000 | 2.000000 | 99.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 81.000000 | 3.000000 | 1.000000 | 2525.000000 | 3.000000 | 4.000000 | 4.000000 |
8 rows × 31 columns
print(ndata[ndata['Cluster']==0].describe())
print(ndata[ndata['Cluster']==1].describe())
print(ndata[ndata['Cluster']==2].describe())
print(ndata[ndata['Cluster']==3].describe())
print(ndata[ndata['Cluster']==4].describe())
spent Age EducationId HaseChild Income \
count 403.000000 403.000000 403.000000 403.000000 403.000000
mean 502.476427 54.223325 0.883375 0.848635 53.504251
std 131.225287 11.059979 1.016675 0.358850 12.071795
min 297.000000 29.000000 0.000000 0.000000 4.428000
25% 399.000000 46.000000 0.000000 1.000000 46.067500
50% 486.000000 54.000000 0.000000 1.000000 54.108000
75% 608.000000 64.000000 2.000000 1.000000 61.286000
max 751.000000 77.000000 3.000000 1.000000 86.836000
NumDealsPurchases Cluster
count 403.000000 403.0
mean 3.555831 0.0
std 2.343348 0.0
min 0.000000 0.0
25% 2.000000 0.0
50% 3.000000 0.0
75% 5.000000 0.0
max 13.000000 0.0
spent Age EducationId HaseChild Income \
count 265.000000 265.000000 265.000000 265.000000 265.000000
mean 1486.075472 53.147170 0.822642 0.298113 75.819570
std 140.314102 12.788083 1.023704 0.458295 13.291157
min 1250.000000 26.000000 0.000000 0.000000 2.447000
25% 1366.000000 43.000000 0.000000 0.000000 69.084000
50% 1482.000000 54.000000 0.000000 0.000000 75.114000
75% 1600.000000 64.000000 2.000000 1.000000 81.702000
max 1738.000000 78.000000 3.000000 1.000000 160.803000
NumDealsPurchases Cluster
count 265.000000 265.0
mean 1.833962 1.0
std 2.150312 0.0
min 0.000000 1.0
25% 1.000000 1.0
50% 1.000000 1.0
75% 2.000000 1.0
max 15.000000 1.0
spent Age EducationId HaseChild Income \
count 1008.000000 1008.000000 1008.000000 1008.000000 1008.000000
mean 90.167659 49.865079 1.020833 0.885913 35.030392
std 73.311883 11.045525 1.145156 0.318075 14.726231
min 5.000000 25.000000 0.000000 0.000000 1.730000
25% 38.000000 42.000000 0.000000 1.000000 25.717500
50% 63.000000 49.000000 1.000000 1.000000 34.421000
75% 122.000000 57.000000 2.000000 1.000000 42.740000
max 296.000000 81.000000 3.000000 1.000000 162.397000
NumDealsPurchases Cluster
count 1008.000000 1008.0
mean 2.073413 2.0
std 1.451147 0.0
min 0.000000 2.0
25% 1.000000 2.0
50% 2.000000 2.0
75% 3.000000 2.0
max 15.000000 2.0
spent Age EducationId HaseChild Income \
count 417.000000 417.000000 417.000000 417.000000 417.000000
mean 1004.647482 54.992806 0.880096 0.594724 67.909659
std 138.811102 11.732698 1.051561 0.491535 10.383615
min 756.000000 26.000000 0.000000 0.000000 33.051000
25% 894.000000 46.000000 0.000000 0.000000 61.014000
50% 1003.000000 55.000000 0.000000 1.000000 67.445000
75% 1127.000000 65.000000 2.000000 1.000000 75.154000
max 1245.000000 78.000000 3.000000 1.000000 102.692000
NumDealsPurchases Cluster
count 417.000000 417.0
mean 2.381295 3.0
std 1.960165 0.0
min 0.000000 3.0
25% 1.000000 3.0
50% 2.000000 3.0
75% 3.000000 3.0
max 15.000000 3.0
spent Age EducationId HaseChild Income \
count 119.000000 119.000000 119.000000 119.000000 119.000000
mean 1992.789916 51.117647 0.932773 0.151261 81.088462
std 188.376286 12.865126 0.963144 0.359818 8.439219
min 1743.000000 27.000000 0.000000 0.000000 61.839000
25% 1834.000000 41.000000 0.000000 0.000000 75.303000
50% 1947.000000 50.000000 1.000000 0.000000 81.169000
75% 2091.500000 61.500000 2.000000 0.000000 87.433500
max 2525.000000 80.000000 3.000000 1.000000 98.777000
NumDealsPurchases Cluster
count 119.000000 119.0
mean 1.176471 4.0
std 1.109642 0.0
min 0.000000 4.0
25% 1.000000 4.0
50% 1.000000 4.0
75% 1.000000 4.0
max 8.000000 4.0
df_copy.columns
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Response', 'Age', 'ChildNumber',
'HaseChild', 'spent', 'EducationId', 'Marital_StatusId', 'Cluster'],
dtype='object')
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
plt.figure(figsize=(15,6))
sb.boxplot(data = df_copy[df_copy['Cluster']==0], x = 'Marital_Status', y = 'spent');
plt.xlabel('Martial Stat')
plt.ylabel('Spendind on site')
plt.show()
plt.figure(figsize=(15,6))
sb.boxplot(data = df_copy[df_copy['Cluster']==0], x = 'Education', y = 'spent');
plt.xlabel('Martial Stat')
plt.ylabel('Spendind on site')
plt.show()
plt.figure(figsize=(15,6))
sb.boxplot(data = df_copy[df_copy['Cluster']==0], x = 'HaseChild', y = 'spent');
plt.xlabel('hase Child or Not')
plt.ylabel('Spendind on site')
plt.show()
intersetColumns
['spent', 'Age', 'EducationId', 'Marital_StatusId', 'HaseChild', 'Income', 'NumDealsPurchases', 'ChildNumber', 'Response', 'Cluster']
import statsmodels.api as sm
fetaure=['spent','Age','Income','NumDealsPurchases','ChildNumber','Response','Cluster']
new_db=df_copy[fetaure].copy()
dummy_matrial=pd.get_dummies(df_copy['Marital_Status'])
dummy_HasChild=pd.get_dummies(df_copy['HaseChild'])
dummy_Education=pd.get_dummies(df_copy['Education'])
new_db=new_db.join(dummy_matrial).join(dummy_Education).join(dummy_HasChild)
new_db['intercept']=1
C1_new_df=new_db[new_db['Cluster']==0]
lm = sm.Logit(C1_new_df['Response'], C1_new_df[['intercept','Divorced', 'Married', 'Single', 'Together', 'Widow', 'Graduation', 'Master', 'PhD', 'UnGraduation']])
results = lm.fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.350171
Iterations: 35
C:\Users\moham\Anaconda3\lib\site-packages\statsmodels\base\model.py:568: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
results.summary2()
| Model: | Logit | Pseudo R-squared: | 0.078 |
| Dependent Variable: | Response | AIC: | 298.2380 |
| Date: | 2021-12-11 20:44 | BIC: | 330.2295 |
| No. Observations: | 403 | Log-Likelihood: | -141.12 |
| Df Model: | 7 | LL-Null: | -153.05 |
| Df Residuals: | 395 | LLR p-value: | 0.0012046 |
| Converged: | 0.0000 | Scale: | 1.0000 |
| No. Iterations: | 35.0000 |
| Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -1.4815 | 7534503.1177 | -0.0000 | 1.0000 | -14767356.2337 | 14767353.2706 |
| Divorced | -0.0894 | nan | nan | nan | nan | nan |
| Married | -0.5199 | nan | nan | nan | nan | nan |
| Single | 0.2247 | nan | nan | nan | nan | nan |
| Together | -1.2132 | nan | nan | nan | nan | nan |
| Widow | 0.1162 | nan | nan | nan | nan | nan |
| Graduation | -0.1022 | nan | nan | nan | nan | nan |
| Master | -1.1824 | nan | nan | nan | nan | nan |
| PhD | 0.5672 | nan | nan | nan | nan | nan |
| UnGraduation | -0.7642 | nan | nan | nan | nan | nan |
C2_new_df=new_db[new_db['Cluster']==1]
lm = sm.Logit(C2_new_df['Response'], C2_new_df[['intercept','Divorced', 'Married', 'Single', 'Together', 'Widow', 'Graduation', 'Master', 'PhD', 'UnGraduation']])
results = lm.fit()
results.summary2()
Optimization terminated successfully.
Current function value: 0.530491
Iterations 25
| Model: | Logit | Pseudo R-squared: | 0.075 |
| Dependent Variable: | Response | AIC: | 297.1600 |
| Date: | 2021-12-11 20:47 | BIC: | 325.7978 |
| No. Observations: | 265 | Log-Likelihood: | -140.58 |
| Df Model: | 7 | LL-Null: | -151.96 |
| Df Residuals: | 257 | LLR p-value: | 0.0018697 |
| Converged: | 1.0000 | Scale: | 1.0000 |
| No. Iterations: | 25.0000 |
| Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -0.9005 | nan | nan | nan | nan | nan |
| Divorced | -0.0233 | 4788001.2432 | -0.0000 | 1.0000 | -9384310.0178 | 9384309.9712 |
| Married | -0.6082 | 4514170.8632 | -0.0000 | 1.0000 | -8847612.9201 | 8847611.7036 |
| Single | 0.6186 | 4788001.2432 | 0.0000 | 1.0000 | -9384309.3760 | 9384310.6131 |
| Together | -1.0770 | 5046996.4560 | -0.0000 | 1.0000 | -9891932.3609 | 9891930.2069 |
| Widow | 0.1894 | 4788001.2432 | 0.0000 | 1.0000 | -9384309.8051 | 9384310.1839 |
| Graduation | 0.2373 | 3073135.0738 | 0.0000 | 1.0000 | -6023233.8270 | 6023234.3016 |
| Master | -0.2634 | 3073135.0738 | -0.0000 | 1.0000 | -6023234.3276 | 6023233.8009 |
| PhD | 0.2675 | 3073135.0738 | 0.0000 | 1.0000 | -6023233.7968 | 6023234.3318 |
| UnGraduation | -1.1420 | 3073135.0738 | -0.0000 | 1.0000 | -6023235.2062 | 6023232.9223 |
C3_new_df=new_db[new_db['Cluster']==2]
lm = sm.Logit(C3_new_df['Response'], C3_new_df[['intercept','Divorced', 'Married', 'Single', 'Together', 'Widow', 'Graduation', 'Master', 'PhD', 'UnGraduation']])
results = lm.fit()
results.summary2()
Optimization terminated successfully.
Current function value: 0.292854
Iterations 14
| Model: | Logit | Pseudo R-squared: | 0.027 |
| Dependent Variable: | Response | AIC: | 606.3932 |
| Date: | 2021-12-11 20:47 | BIC: | 645.7190 |
| No. Observations: | 1008 | Log-Likelihood: | -295.20 |
| Df Model: | 7 | LL-Null: | -303.29 |
| Df Residuals: | 1000 | LLR p-value: | 0.023478 |
| Converged: | 1.0000 | Scale: | 1.0000 |
| No. Iterations: | 14.0000 |
| Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -1.5415 | 4684857.4137 | -0.0000 | 1.0000 | -9182153.3451 | 9182150.2622 |
| Divorced | -0.3089 | nan | nan | nan | nan | nan |
| Married | -0.7122 | nan | nan | nan | nan | nan |
| Single | 0.1562 | nan | nan | nan | nan | nan |
| Together | -0.5934 | nan | nan | nan | nan | nan |
| Widow | -0.0830 | nan | nan | nan | nan | nan |
| Graduation | -0.5299 | nan | nan | nan | nan | nan |
| Master | 0.0156 | nan | nan | nan | nan | nan |
| PhD | -0.2554 | nan | nan | nan | nan | nan |
| UnGraduation | -0.7718 | nan | nan | nan | nan | nan |
C4_new_df=new_db[new_db['Cluster']==3]
lm = sm.Logit(C4_new_df['Response'], C4_new_df[['intercept','Divorced', 'Married', 'Single', 'Together', 'Widow', 'Graduation', 'Master', 'PhD', 'UnGraduation']])
results = lm.fit()
results.summary2()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.368433
Iterations: 35
C:\Users\moham\Anaconda3\lib\site-packages\statsmodels\base\model.py:568: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
| Model: | Logit | Pseudo R-squared: | 0.096 |
| Dependent Variable: | Response | AIC: | 323.2727 |
| Date: | 2021-12-11 20:47 | BIC: | 355.5374 |
| No. Observations: | 417 | Log-Likelihood: | -153.64 |
| Df Model: | 7 | LL-Null: | -169.99 |
| Df Residuals: | 409 | LLR p-value: | 2.9979e-05 |
| Converged: | 0.0000 | Scale: | 1.0000 |
| No. Iterations: | 35.0000 |
| Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -1.0440 | nan | nan | nan | nan | nan |
| Divorced | 0.5933 | 10572465.5492 | 0.0000 | 1.0000 | -20721651.1109 | 20721652.2975 |
| Married | -0.7162 | 10572465.5492 | -0.0000 | 1.0000 | -20721652.4204 | 20721650.9880 |
| Single | -0.3881 | 10572465.5492 | -0.0000 | 1.0000 | -20721652.0923 | 20721651.3161 |
| Together | -0.7827 | 10572465.5492 | -0.0000 | 1.0000 | -20721652.4869 | 20721650.9215 |
| Widow | 0.2497 | 10572465.5492 | 0.0000 | 1.0000 | -20721651.4545 | 20721651.9539 |
| Graduation | -1.0926 | 11757550.2789 | -0.0000 | 1.0000 | -23044376.1858 | 23044374.0006 |
| Master | -0.1985 | 11757550.2789 | -0.0000 | 1.0000 | -23044375.2917 | 23044374.8946 |
| PhD | 0.3858 | 11757550.2789 | 0.0000 | 1.0000 | -23044374.7074 | 23044375.4789 |
| UnGraduation | -0.1386 | 11757550.2789 | -0.0000 | 1.0000 | -23044375.2318 | 23044374.9545 |
lm = sm.Logit(new_db['Response'], new_db[['intercept','Divorced', 'Married', 'Single', 'Together', 'Widow', 'Graduation', 'Master', 'PhD', 'UnGraduation']])
results = lm.fit()
Optimization terminated successfully.
Current function value: 0.407993
Iterations 15
results.summary()
| Dep. Variable: | Response | No. Observations: | 2212 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 2203 |
| Method: | MLE | Df Model: | 8 |
| Date: | Sat, 11 Dec 2021 | Pseudo R-squ.: | 0.03695 |
| Time: | 20:20:13 | Log-Likelihood: | -902.48 |
| converged: | True | LL-Null: | -937.11 |
| Covariance Type: | nonrobust | LLR p-value: | 6.897e-12 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -1.1302 | 2.15e+06 | -5.25e-07 | 1.000 | -4.22e+06 | 4.22e+06 |
| Divorced | 0.0241 | 2.63e+06 | 9.15e-09 | 1.000 | -5.16e+06 | 5.16e+06 |
| Married | -0.6846 | 2.63e+06 | -2.6e-07 | 1.000 | -5.16e+06 | 5.16e+06 |
| Single | 0.1609 | 2.63e+06 | 6.11e-08 | 1.000 | -5.16e+06 | 5.16e+06 |
| Together | -0.7691 | 2.63e+06 | -2.92e-07 | 1.000 | -5.16e+06 | 5.16e+06 |
| Widow | 0.1385 | 2.63e+06 | 5.26e-08 | 1.000 | -5.16e+06 | 5.16e+06 |
| Graduation | -0.3603 | 2.73e+06 | -1.32e-07 | 1.000 | -5.35e+06 | 5.35e+06 |
| Master | -0.2014 | 2.73e+06 | -7.38e-08 | 1.000 | -5.35e+06 | 5.35e+06 |
| PhD | 0.1732 | 2.73e+06 | 6.35e-08 | 1.000 | -5.35e+06 | 5.35e+06 |
| UnGraduation | -0.7417 | 2.73e+06 | -2.72e-07 | 1.000 | -5.35e+06 | 5.35e+06 |
Make sure that, after every plot or related series of plots, that you include a Markdown cell with comments about what you observed, and what you plan on investigating next.
Your answer here!
Your answer here!
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
Your answer here!
Your answer here!
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
At the end of your report, make sure that you export the notebook as an html file from the
File > Download as... > HTMLmenu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!